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Abstract 

We consider approximate dynamic programming for the infinite-horizon stationary 7-discounted 
optimal control problem formalized by Markov Decision Processes. While in the exact case it is known 
that there always exists an optimal policy that is stationary, we show that when using value function 
approximation, looking for a non-stationary policy may lead to a better performance guarantee. We 
define a non-stationary variant of MP! that unifies a broad family of approximate DP algorithms 
of the literature. For this algorithm we provide an error propagation analysis in the form of a 
performance bound of the resulting policies that can improve the usual performance bound by a 
factor 0(1 — 7), which is significant when the discount factor 7 is close to 1. Doing so, our approach 
unifies recent results for Value and Policy Iteration. Furthermore, we show, by constructing a specific 
deterministic MDP, that our performance guarantee is tight. 
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1 Introduction 

We consider a discrete-time dynamic system whose state transition depends on a control. We assume 
that there is a state space X. When at some state, an action is chosen from a finite action space A. 
The current state x € X and action a € A characterizes through a homogeneous probability kernel 
P(dx\x,a) the next state's distribution. At each transition, the system is given a reward r{x,a,y) G M 
where r'.XxAxY^M. is the instantaneous reward function. In this context, we aim at determining 
a sequence of actions that maximizes the expected discounted sum of rewards from any starting state x: 
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y^^'y''r{xk,ak,Xk+i) 



Lfe=o 



xo 



^ X, xt+i ^ P{-\xt,at), ao, ai. 



where < 7 < 1 is a discount factor. The tuple {X, A, P, r, 7) is called a Markov Decision Process (MDP) 
and the associated optimization problem infinite-horizon stationary discounted optimal control (Puter- 
man, 1994; Bertsekas and Tsitsiklis, 1996) . 

An important result of this setting is that there exists at least one stationary deterministic policy, 
that is a function tt : X ^ A that maps states into actions, that is optimal (Putcrman, 1994). As 
a consequence, the problem is usually recast as looking for the stationary deterministic policy tt that 
maximizes for all initial state x the quantity 



VTr{x) := E 



y^7^r(a;fc,7r(a;fc),a:fc+i) 



fc=0 



Xo = x, Xt+1 -- P-^{-\xt) 
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also called the value of policy tt at state x, and where we wrote PTr{dx\x) for the stochastic kernel 
P{dx\x,Tr{s)) that chooses actions according to policy tt. We shall similarly write r^^ : X —^ M. for the 
function that giving the immediate reward while following policy tt: 

\/x, r^(x) = E[r(a;o,7r(a;o),a;i) | xi ^ P-„{-\xo)\ . 

Two linear operators are associated to the stochastic kernel P^: a left operator on functions 

V/eM^, VxeX, {P^f){x) = j f{y)P^{dy\x)^E[J{xi)\x^^PA-\xo)], 

and a right operator on distributions: 

V/^, {nP.^){dy) = / P,r{dy\x)fi{dx). 



In words, Pt^/^x) is the expected value of / after following policy tt for a single time-step starting from 
X, and fiPjr is the distribution of states after a single time-step starting from fj,. 

Given a policy n, it is well known that the value v-j^ is the unique solution of the following Bellman 
equation: 

In other words, v^i is the fixed point of the affine operator T^w :— r^ + ^Pt^v. 
The optimal value starting from state x is defined as 

v^^,{x) :— inaxt;^(x). 

TT 

It is also well known that v.^, is characterized by the following Bellman equation: 

u* — max(r^ + ^P.,^v^,) = maxT^w*, 

where the max operator is componentwise. In other words, w* is the fixed point of the nonlinear operator 
Tv := maxTT Tj^v. For any value vector v, we say that a policy tt is greedy with respect to the value v if 
it satisfies: 

TT € argmaxT'^ v 

tt' 

or equivalently T^^v — Tv. We write, with some abuse of notation^ G{v) any policy that is greedy with 
respect to v. The notions of optimal value function and greedy policies are fundamental to optimal 
control because of the following property: any policy tt^ that is greedy with respect to the optimal value 
is an optimal policy and its value u^r, is equal to w* . 

Given an MDP, we consider approximate versions of the Modified Policy Iteration (MPI) algorithm 
(Putcrman and Shin, 1978). Starting from an arbitrary value function vq, MPI generates a sequence of 
value-policy pairs 

T^k+i = G{vk) (greedy step) 

Vk+i = (T^j^^J™+^Wfc + ek (evaluation step) 

where m > is a free parameter. At each iteration fc, the term e^ accounts for a possible approximation 
in the evaluation step. MPI generalizes the well-known dynamic programming algorithms Value Iteration 
(VI) and Policy Iteration (PI) for values m — and 771 = 00, respectively. In the exact case (efc = 0), MPI 
requires less computation per iteration than PI (in a way similar to VI) and enjoys the faster convergence 
(in terms of number of iterations) of PI (Puterman and Shin, 1978; Puterman, 1994). 

It was recently shown that controlling the errors e^ when running MPI is sufficient to ensure some 
performance guarantee (Schcrrer and Thicry, 2010; Scherrer et ai, 2012a, b; Canbolat and Rothblum, 
2012). For instance, we have the following performance bound, that is remarkably independent of the 
parameter 777. 



^ There might be several policies that are greedy with respect to some value v. 



Theorem 1 (Scherrer et al. (2012a, Remark 2)). Consider MPI with any parameter m > 0. Assume 
there exists an e > such that the errors satisfy ||efc||oo < e for all k. Then, the loss due to running 
policy TTk instead of the optimal policy tt* satisfies 



2(7 ~ 7^') 2/ 

(1-7)2 '+1-7' 



WttJIoo < — — rvre+ -, — tIp* -«o| 



In the specific case corresponding to VI {m ~ 0) and PI {m — oo), this bound matches performance 
guarantees that have been known for a long time (Singh and Yee, 1994; Bertsekas and Tsitsiklis, 1996). 
The constant dS' )2 can be very big, in particular when 7 is close to 1, and consequently the above 
bound is commonly believed to be conservative for practical applications. Unfortunately, this bound 
cannot be improved: Bertsekas and Tsitsiklis (1996, Example 6.4) showed that the bound is tight for 
PI, Scherrer and Lesner (2012) proved that it is tight for VP, and we will prove in this article^ the — to 
our knowledge unknown — fact that it is also tight for MPI. In other words, improving the performance 
bound requires to change the algorithms. 

2 Main Results 

Even though the theory of optimal control states that there exists a stationary policy that is optimal, 
Scherrer and Lesner (2012) recently showed that the performance bound of Theorem 1 could be improved 
in the specific cases m = and to = (X) by considering variations of VI and PI that build periodic non- 
stationary policies (instead of stationary policies). In this article, we consider an original MPI algorithm 
that generalizes these variations of VI and PI (in the same way the standard MPI algorithm generalizes 
standard VI and PI). Given some free parameters to, > and £ > 1, an arbitrary value function vq 
and an arbitrary set oi £ — 1 policies ttq, 7r_i, 7r_£+2, consider the algorithm that builds a sequence of 
value-policy pairs as follows: 

TTfe+i =G{vk) (greedy step) 

Vk+i = (7Vfc+i.J™T:n-fc+iffc + efc- (evaluation step) 

While the greedy step is identical to the one of the standard MPI algorithm, the evaluation step involves 
two new objects that we describe now. tt/^+i ^ denotes the periodic non-stationary policy that loops in 
reverse order on the last £ generated policies: 

TTk+l,i = TTfc+l TTfc • • • Trk-e+2 T^k+l TTfc • • • 7rfe_£+2 ' ' ' 
" V ' " V ' 

last £ policies last i policies 

Following the policy -Kk+i/ means that the first action is selected by 'Kk+i, the second one by tt^, until 
the £ one by 7rfe_f+2, then the policy loops and the next actions are selected by T^k+i, T^k, so on and so 
forth. In the above algorithm, T^^^j^-^ ^ is the linear Bellman operator associated to i^k+i/'- 

T — T T T 

that is the operator of which the unique fixed point is the value function of TTk+i,e- After k iterations, 
the output of the algorithm is the periodic non-stationary policy nk/. 

For the values m = and 771 = cxi, one respectively recovers the variations of VI* and PI recently 
proposed by Scherrer and Lesner (2012). When £ = 1, one recovers the standard MPI algorithm by 
Puterman and Shin (1978) (that itself generalizes the standard VI and PI algorithm). As it generalizes 
all previously proposed algorithms, we will simply refer to this new algorithm as MPI with parameters 
TO, and £. 



^Though the MDP instance used to show the tightness of the bound for VI is the same as that for PI (Bertsekas and 
Tsitsikhs, 1996, Example 6.4), Scherrer and Lesner (2012) seem to be the first to argue about it in the literature. 

■^Theorem 3 page 4 with £ = 1. 

*As already noted by Scherrer and Lesner (2012), the only difference between this variation of VI and the standard 
VI algorithm is what is output by the algorithm. Both algorithms use the very same evaluation step: v^-^i = T„i,,-^Vf;. 
However, after k iterations, while standard VI returns the last stationary policy Trj., the variation of VI returns the non- 
stationary policy TTfe £. 



On the one hand, using this new algorithm may require more memory since one must store i pohcies 
instead of one. On the other hand, our first main resuh, proved in Section 4, shows that this extra 
memory allows to improve the performance guarantee. 

Theorem 2. Consider MPI with any parameters m>Q and £ > 1. Assume there exists an e > such 
that the errors satisfy ||efe||co < e for all k. Then, the loss due to running policy Tikj instead of the 
optimal policy n^ satisfies 
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(i-7)(i-y)' 
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As already observed for the standard MPI algorithm, this performance bound is independent of m. 



For any £ > 1, it is a factor 
bound of 



l-7« 



better than in Theorem 1. Using i 



1-7 



yields^ a performance 



w* 



3.164(7-7*^) 
1-7 



«o||, 



27" II 

and constitutes asymptotically an improvement of order 0(1 — 7), which is significant when 7 is close 
to 1. In fact. Theorem 2 is a generalization of Theorem 1 for £ > 1 (the bounds match when £ ^ 1). 
While this result was obtained through two independent proofs for the variations of VI and PI proposed 
by Scherrer and Lesncr (2012), the more general setting that we consider here involves a unified proof 
that extends that provided for the standard MPI {£ — 1) by Scherrer et al. (2012b). Moreover, our result 
is much more general since it applies to all the variations of MPI for any £ and m. 




Figure 1: The deterministic MDP matching the bound of Theorem 2. 

The second main result of this article, proved in Section 5, is that the bound of Theorem 2 is tight, 
in the precise sense formalized by the following theorem. 

Theorem 3. For all parameter values m > and £ > 1, for all e > 0, there exists an MDP instance, an 
initial value function vq, a set of initial policies ttq, 7r_i, . . . , 7r_f4-2 and a sequence of error terms {ek)k>i 
satisfying HefeHoo < £, such that for all iterations k, the hound of Theorem 2 is satisfied with equality. 

This theorem generalizes the (separate) tightness results for PI (Bertsekas and Tsitsiklis, 1996) and 
for VI (Scherrer and Lesner, 2012) where the problem constructed to attain the bound is a specialization 
of the one we use in Section 5. To our knowledge, this result is new even for the standard MPI algorithm 
[m arbitrary but £ = 1), and for the non-trivial non-stationary variations of VI (to = 0, ^ > 1) and PI 
{m — 00, £ > 1). The proof considers a generalization of the MDP instance used to prove the tightness 
of the bound for VI (Scherrer and Lesner, 2012) and PI (Bertsekas and Tsitsiklis, 1996, Example 6.4). 
Precisely, this MDP consists of states {1, 2, . . . }, two actions: left (•<— ) and right (—>■); the reward function 
r and transition kernel P are characterized as follows for any state i > 2: 



r(z,^)=0, 
P(z|z + l,^) = l, 



r(*,^) = -2^-^e, 
1-7 



P(^ + €-l|^,^) = l, 



and r(l) — and P(l|l)l for state 1 (all the other transitions having zero probability mass). As a 
shortcut, we will use the notation r^ for the non-zero reward r(z, — >) in state i. Figure 1 depicts the 



Using the facts that 1 — 7 < 



;7 and log 7 < 0, we have log 7 < log7i-T < 



-r — log 7 ■ 

- log 7 ^ ' 



-1 hence 7 < e ^. 



Therefore 



l-7« 



— < 3.164. 



general structure of this MDP. It is easily seen that the optimal policy ^T^, is to take <— in all states z > 2, as 
doing otherwise would incur a negative reward. Therefore, the optimal value w*(i) is in all states i. The 
proof of the above theorem considers that we run MPI with vq — v^, — 0, ttq — tt-i — ■ ■ ■ — 111+2 — ""*) 
and the following sequence of error terms: 

— e if i = fc, 
Vi, ^k{i) = \ e if i = fc + £, 
otherwise. 

In such a case, one can prove that the sequence of policies tti , 7r2, . . . , tt^ that are generated up to iteration 
k is such that for all i < k, the policy tt^ takes •<— in all states but i, where it takes — >■. As a consequence, 
a non-stationary policy nk^e built from this sequence takes — > in k (as dictated by TTfc), which transfers 
the system into state k + £ — 1 incurring a reward of r^. Then the policies ■Kk-i, ''^k-2, ■ ■ ■ , T^k-e+i are 
followed, each indicating to take ^— with reward. After £ steps, the system is again is state k and, by 
the periodicity of the policy, must again use the action TTk{k) = — >. The system is thus stuck in a loop, 
where every i steps a negative reward of r^ is received. Consequently, the value of this policy from state 
k is: 

k 

V.,,, (k) = r, + 7^(r, + 7' (r,. + • • • ) • • • ) - T^ = - r. T,w7 „m ^e- 



1-y (i-7)(i-7^) 



As a consequence, we get the following lower bound. 



k 

which exactly matches the upper bound of Theorem 2 (since vq = Va. =0). The proof of this result 
involves computing the values Wfc(i) for all states i, steps k of the algorithm, and values m and £ of the 
parameters, and proving that the policies tt^+i that are greedy with respect to these values satisfy what 
we have described above. Because of the cyclic nature of the MDP, the shape of the value function is 
quite complex — see for instance Figures 2 and 3 in Section 5 — and the exact derivation is tedious. For 
clarity, this proof is deferred to Section 5. 

3 Discussion 

Since it is well known that there exists an optimal policy that is stationary, our result — as well as those of 
Scherrer and Lesner (2012) — suggesting to consider non-stationary policies may appear strange. There 
exists, however, a very simple approximation scheme of discounted infinite-horizon control problems — 
that has to our knowledge never been documented in the literature — that sheds some light on the deep 
reason why non-stationary policies may be an interesting option. Given an infinite-horizon problem, 
consider approximating it by a finite-horizon discounted control problem by "cutting the horizon" after 
some sufficiently big instant T (that is assume there is no reward after time T). Contrary to the 
original infinite-horizon problem, the resulting finite-horizon problem is non-stationary, and has therefore 
naturally a non-stationary solution that is built by dynamic programming in reverse order. Moreover, it 
can be shown (Kakade, 2003, by adapting the proof of Theorem 2.5.1) that solving this finite-horizon with 
VI with a potential error of e at each iteration, will induce at most a performance error of 2 X)i=o 7*^ ~ 
^ j^~^ ' e. If we add the error due to truncating the horizon ^^T maxs^^ajrts.aji -^^ ^^ ^^^ ^^ overall error 

of order O I jz— e) for a memory T of the order of^ O I yi— ) . Though this approximation scheme may 

require a significant amount of memory (when 7 is close to 1), it achieves the same 0(1 — 7) improvement 
over the standard MPI performance bound as our MPI new scheme proposed through our generalization 
of MPI with two parameters m and £. In comparison, the new proposed algorithm can be seen as a more 
flexible way to make the trade-off between the memory and the quality. 

A practical limitation of Theorem 2 is that it assumes that the errors e^, are controlled in max norm. In 
practice, the evaluation step of dynamic programming algorithm is usually done through some regression 

6 We use the fact that yT j^ ^ . ^ j. ^ '"" ^^^^^ ~ '°^ f^^ with K = "'""°-° '"'''■"'I . 

' 1-7 loe -!- 1 — f 1 — f 



scheme — see for instance (Bertsekas and Tsitsiklis, 1996; Antos et ai, 2007a, b; Scherrer et al, 2012a) — 
and thus controlled through some weighted quadratic L2,/j norm, defined as |l/|i2,/^ = \/ J f {^) f^{d,x) . 
Munos (2003, 2007) originally developed such analyzes for VI and PI. Farahmand et al. (2010) and 
Scherrer et al. (2012a) later improved it. Using a technical lemma due to Scherrer et al. (2012a, Lemma 
3), one can easily deduce^ from our analysis (developed in Section 4) the following performance bound. 

Corollary 1. Consider MPI with any parameters m > and i > 1. Assume there exists an e > such 
that the errors satisfy \\ek\\2.fi < e for all k. Then, the expected (with respect to some initial measure p) 
loss due to running policy ttj^i instead of the optimal policy n^ satisfies 



E[v^{s)-v^,,{s) \ s^ p] < \i'i_^i\ ^+ 1^ ' H^* -«oli2,p> 

where 



2(7~7^)Ci,M 2y^Ckjc+i 
(l-7)(l-7^)' 1-7 



i—j 71— i 

is a convex combination of concentrability coefficients based on Radon- Nikodym derivatives 

d{pp^,p^, ■ ■ ■ p^j 



c{j) = max 



dfi 



2^1 



With respect to the previous bound in max norm, this bound involves extra constants Cj^k,i ^ 1- 
Each such coefficient Cj^k,i is a convex combination of terms c(i), that each quantifies the difference 
between 1) the distribution p, used to control the errors and 2) the distribution obtained by starting from 
p and making k steps with arbitrary sequences of policies. Overall, this extra constant can be seen as 
a measure of stochastic smoothness of the MDP (the smoother, the smaller). Further details on these 
coefficients can be found in (Munos, 2003, 2007; Farahmand et al., 2010). 

The next two sections contain the proofs of our two main results, that are Theorem 2 and 3. 

4 Proof of Theorem 2 

Throughout this proof we will write Pk (resp. P.^,) for the transition kernel P-^^ (resp. P-„^) induced by 
the stationary policy tt^ (resp. tt*). We will write T^ (resp. T^,) for the associated Bellman operator. 
Similarly, we will write P^j for the transition kernel associated with the non-stationary policy tt^^^ and 
T^^i for its associated Bellman operator. 

For fc > we define the following quantities: 

• hk = Tk+iVk — Tk+i.iTk+iVk- This quantity which we will call the residual may be viewed as a 
non-stationary analogue of the Bellman residual w^ — Tk+iVk- 

• Sk = Vk — Wtt^ ( — ek- We will call it shift, as it measures the shift between the value Vj^^, ^ and the 
estimate Vk before incurring the error. 

• dk = v^, — Vk + ek- This quantity, called distance thereafter, provides the distance between the k*'^ 
value function (before the error is added) and the optimal value function. 

• Ik = v^ — Vtt^ (, . This is the loss of the policy Vt^^. ^ . The loss is always non-negative since no policy 
can have a value greater than or equal to u* . 

The proof is outlined as follows. We first provide a bound on bk which will be used to express both the 
bounds on Sk and dk. Then, observing that Ik = Sk + dk will allow to express the bound of ||^fe||oo stated 
by Theorem 2. Our arguments extend those made by Scherrer et al. (2012a) in the specific case i = 1. 
We will repeatedly use the fact that since policy iTk+i is greedy with respect to Vk, we have 

Vtt', Tk+iVk > T^'Vk- (2) 



^Precisely, Lemma 3 of (Scherrer et al., 2012a) should be apphed to Equation (5) page 11 in Section 4. 



For a non-stationary policy nk^i, the induced f-step transition kernel is 

Pk,e = PkPk-l ■ ■ ■ Pk-e+1- 
As a consequence, for any function / : 5 — > M, the operator T^^g may be expressed as: 

Tk,ef = Tk + jPk.irk-i + j^Pk,2rk-2 H h j'^^^Pk,e-irk-e+i + l^Pk.tf 

then, for any function g : 5 — >■ M, we have 

Tk,d - Tk,e,g = I'PkAf - d) (3) 

and 

TkAf + 9)=Tk,if + YPkA9)- (4) 

The following notation will be useful. 

Definition 1 (Scherrer et al. (2012a)). For a positive integer n, we define P„ as the set of discounted 
transition kernels that are defined as follows: 

1. for any set of n policies {tti, . . . , 7r„}, {-fP^^)[-fP^^) ■ ■ ■ {'jP-^J G Vn, 

2. for any a e (0, 1) and Pi, P2 e Pn, aPi + (1 - a)P2 E P„ 

With some abuse of notation, we write F" for denoting any element o/P„. 

Example 1 (F" notation). If we write a transition kernel P as P = aiT^ +a2T^T'^ = aiT^ +a2T^'^'' , it 
should be read as: "There exists Pi S PijPj € ^j jPs S Pfe o-nd P4 G Pj+fe such that P — aiPi +a2P2P3 = 
aiPi + a2P4-"- 

We first provide three lemmas bounding the residual, the shift and the distance, respectively. 

Lemma 1 (residual bound) . The residual bk satisfies the following bound: 

k 

bk < ^F^^^+i)^'^-^)^, +F(^™+i)'=&o 

where 

Xk = (/-F^)Fefe. 

Proof. We have: 

bk = Tk+iVk — Tk+i/Tk+iVk 

< Tk+iVk - Tk+ijTk-i+iVk {Tk+iVk > Tk-i+iVk (2)} 

= Tk+iVk - Tk+iTkjVk 

= 7Pfc+i (wfe - Tk^iVk) 

= iPk+i {{TkA"'TkVk-i + €k - Tk^e {{TkjrTkVk-i + Ek)) 

= 7Pfc+i {{TkA'^TkVk-i - {Tk,ir+^TkVk-i + {I- l'Pk,i)^k) {(4)} 

= 7Pfe+i {{I'PkiT {TkVk-i - TkiTkVk-i) + (/ - YPk,i)ek) {(3)} 

= 7Pfc+i {{1' Pk^iTbk-i + {I- I'PkA^k) ■ 

Which can be written as 

bk < r(F^"6fe_i + (/ - r')ek) = F^"+i6fe_i + Xk. 
Then, by induction: 

bk < ^F(^'"+i)'a;fc_, +F(^'"+i)'=6o = ^F^^^+i)^^-')^;, + F(^'"+i)'=6o- 



i=0 



n 



Lemma 2 (distance bound). The distance d^ satisfies the following bound: 

k tni—l k 

1=1 _;=0 i=l 



where 
and 



Vk = -refe 



mk — 1 

Proof. First expand dk'- 

dk ^ V^ - Vk+ €k 

^v*- {TkjTTkVk-1 

^ v^ - TkVk-i + TkVk-i ~ TkjTkVk-i + TkjTkVk-i - {Tk,i)'^TkVk-i 

+ [TkxfTkVk-i {Tk,i)™''^TkVk-i + {Tk^eY"'^^TkVk^i - {Tk,i)"^TkVk- 

ra—l 

= w* - TkVk-1 + ^ {Tk^eYTkVk-i - {TkjY^^TkVk-i 

i=0 



= T^v^ - TkVk-i + ^ {l^Pk.iY {Tki'k-i - TkjTkVk-i) {(3)} 

4=0 

7n— 1 

< T,v, - T.Vk-i + Y. h'Pk/Ybk-i {TkVk-i > nvk-i (2)} 

i=0 
?n — 1 

= 7-P*(^* - «fc-i) + Y. h'Pk,iybk-i {(3)} 

i=0 

m— 1 

= 7P*dfc_i - 7P*efc_i + ^ {-f'-Pk/Ybk-i {dk = w* - Wfc + e/c} 

i=0 
m— 1 



8=0 



Then, by induction 

fc-l / m-l \ 

J=0 \ p=0 / 

Using the bound on hk from Lemma 1 we get: 

k-l ( m-l / i \ \ 

J=0 V p=0 \i=l // 

k—lni—l j k—lm—1 k 

= E E £ r*'-i--''+^p+(^"+i)(^-^)a;, + E E r''^-i-^+^p+(^™+i)^"&o + r'-^do + E ^'^'y^- 

j=0 p=0 4=1 j=0 p=0 4=1 



First we have: 



k — l rn — 1 j k — 1 k — 1 m—1 

y^ y^ y^pfc-i-i+£p+(i?m+i)o-j)^. ^ y^y^ y^ pfc-i+<?(p+mj)-i(i?m+i)^. 

j—O p—0 i—1 i—1 j—i p— 

fe-lm(fc-i)-l 
^ y^ y^ pfc-l+«0+rm)-i(£m+l)^. 

fe-lm(fc-'i)-l 
k~l mi— 1 



Second we have: 

fc — Im— 1 k—lm—1 mk—1 

j=0 p=0 j=0 p=0 j=0 

Hence 

/c mi— 1 k 

rffe < E E r'^^*-'2;fe_, + ^ ^-'yk-^ + zk. 

i=l j=0 i=l 



Lemma 3 (shift bound). The shift Sk is bounded by: 

k — 1 oo 
i—1 j—mi 

where 

CO 

j—mk 

Proof. Expanding Sk we obtain: 

= {Tkj)"'TkVk-i - w^j^_, 

= iTk,e)"^TkVk-i - iTkj)°°Tk^eTkVk-i {V/ : v„^c = (7fc,£)°°/} 



OO 



J=0 

oo 



n 






Plugging the bound on bk of Lemma 1 we get: 

oo /k — 1 



CO k — 1 CO 

oo k — 1 oo 

i=0 i=l j=0 

k—1 oo oo 

k—1 cxD 
2 — 1 j—mi 



Lemma 4 (loss bound). T/ie loss Ik is bounded by: 



k — 1 I oo 

^fc < E r' E r''(^ - r') - ^ I efc-. + 'yfc, 

where 

rak—l oo oo 

■i— j—mk i—0 

Proof. Using Lemmas 2 and 3, we have: 

h — Sk -\^ dk 

k—1 oo A; — 1 mi—l k 

< E E r'^'+^-^x,_,: + E E r'^'+^-^x,_,: + Y, r-ij/fe_, + z^ + «;, 

i— 1 j—mi i—1 j—0 i—1 

k—1 oo k 

= E E r'''+^"'^^-': + E ^'^'y^-^ + ^fc- 

i=l j = 4 = 1 

Plugging back the values of Xk and y^ and using the fact that eo = we obtain: 

k—1 oo k—1 

/fc < E E ^''"-'-'{i - r')rek-^ + Y, r^-'(-r)efc_, - r^'eo + % 

i=l j=0 i=l 

fe-1 / oo \ 

= E Er'''^^(^-r%-.-re.-J +% 
i=i yj=o / 

fc-l / oo \ 

i=l \i=0 / 



We now provide a bound of rjk in terms of do: 
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n 



n 



Lemma 5. 

Vk<r''\f^r\T-i) + i]do. 

Proof. First recall that 

oo 
i=0 

In order to bound rj^. in terms of do only, we express 60 in terms of do- 

bo = Tivo - {TifTiVo 

= Tivo - {Tifvo + {Tifvo {TiYvo + {Tifvo - {T^f+^vo 



^Y.^lPiy{vo~T^vo) 

i=l 

— 5I(">'^i)*'^'"o - w* + T*w* - T^vo + T^vo - TiVq) 

i=\ 

- ^ilPiTi'^o - V* + T^v* - T^vo) {TiVo > T^vo (2)} 

i=l 

= Y.^iPiyhP*-i)do- 



i=l 

Consequently, we have: 



vk < ^r^*+'=-i ^(7Pi)-'"(7n - i)do + r'^do 

i=0 j=l 

00 i-1 

1=0 j=0 

(00 £-1 \ 

^r^'^p(r-/) + / do 
i=0 j=0 / 



= r'= K^r(r-/) + / do. 



n 

We now conclude the proof of Theorem 2. Taking the absolute value in Lemma 5 we obtain: 

Coo \ 00 

^r(r + /) + / |do|-2^r|do| 
1=0 / i=k 

Since Ik is non-negative, from Lemma 4 we have: 

k — l / 00 \ fc — 1 00 00 

i=l \j=0 J 1=1 j=0 i=fc 

Since HwHo^ = max |ti|, do = ''^* ~ ^o a-nd /^ = f * — w^j. j, we can take the maximum in (5) and conclude 
that: 

2(7-7'') o ^ 

(i-7)(i-y) '^1-7' 



1^* - «-.,.L ^ n_.,ui_.,n ^^ + ^ — ^11^* - ^oll' 
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5 Proof of Theorem 3 

We shall prove the following result. 



Lemma 6. Consider MPI with parameters m > and £ > I applied on the problem of Figure 1, starting 
from vq — Q and all initial policies 7ro,7r_i, . . . , 7r_£4-2 equal to tt*. Assume that at each iteration k, the 
following error terms are applied, for some e > 0; 

{— e if i = k 
e if i = k + £ . 
otherwise 

Then MPI car? generate a sequence of value-policy pairs that is described below. 

For all iterations fc > 1, the policy iTk always takes the optimal action in all states, that is 

For all iterations k > 1, the value function Vk is defined as follows: 

• For all i < k: 

• For all i such that k < i < k + {{k — l)ni + 1)£: 

— For i = k + (qm + p + 1)£ with q > and < p < m (i.e. i — k + n£, n > I): 







— For i = k: 

Vk{k) = Vk{k + £) + rk - 2e 

— For i = k + q£ + p with < (7 < (fc — l)ni — 1 and 1 < p < £: 

Vk{i) = -^('^-i)(^™+i)e (7.d) 

— Otherwise, i.e. when i — k + (k — l)m£ + p with 1 < p < £: 

i'feW=0 (7.e) 

• For alli> k + ((fc - 1)to + 1)£ 

Vkii)^0 (7.f) 

The relative complexity of the different expressions of Vk in Lemma 6 is due to the presence of nested 
periodic patterns in the shape of the value fonction along the state space and the horizon. Figures 2 
and 3 give the shape of the value function for different values of £ and m, exhibiting the periodic patterns. 
The proof of Lemma 6 is done by recurrence on fc. 

5.1 Base case k = I 

Since vq = 0, tti is the optimal policy that takes ^ in all states as desired. Hence, {Ti^e)™TiVo — in 
all states. Accounting for the errors ei we have vi = (ri,f)'"TiWo + ei = ei. As can be seen on Figures 2 
and 3, when fc = 1 we only need to consider equations (7.b), (7.c), (T.c) and (7.f) since the others apply 
to an empty set of states. 
First, we have 

vi{l + £) = ei{l + £) = e 



'We write here "can" since at each iteration, several policies will be greedy with respect to the current value. 
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Figure 2: Shape of the value function with £ ~ 2 and m — 3. 
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Figure 3: Shape of the value function with i — 3 and m — 2. 
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which is (7.b) when q — {k — 1) = and p = 0. 
Second, we have 

vi(l) = ei(l) = -e = e + - 2e = wi(l + £) + ri - 2e 

which corresponds to (7.c). 

Third, for 1 < p < ^ we have 

vi{l + p) = ei{l + p) = 

corresponding to (7.e). 

Finally, for all the remaining states i > 1 + ^, we have 



vi{i) = ei{i) = 



corresponding to (7.f). 

The base case is now proved. 



5.2 Induction Step 

We assume that Lemma 6 holds for some fixed fc > 1, we now show that it also holds for fc + 1. 

5.2.1 The policy TTk+i 

We begin by showing that the policy TTfc+i is greedy with respect to v^- Since there is no choice in state 
1 is — >■, we turn our attention to the other states. There are many cases to consider, each one of them 
corresponding to one or more states. These cases, labelled from A through F, are summarized as follows, 
depending on the state i: 

(A) 1< i < fc + 1 

(B) i = k + l 

(C) i^k + 1 + qe + p with 1 <p<e and < g < (fc - l)m 

(D) z = fc + 1 + {qm +p+l)£ with < p < m and <q < k-1 

(E) i = k + l + {{k- l)m + l)£ 

(F) i > fc + l + ((fc- l)m + l)£ 

Figure 4 depicts how those cases cover the whole state space. 

fc + l + ((fc- l)m + l)^ 
fc + 1 + 2^ fc + l + (fc- l)ml 





Figure 4: Policy cases, each state is represented by a letter corresponding to a case of the policy TTk+i- 
Starting from 1, state number increase from left to right. 

For all states i > 1 in each of the above cases, we consider the action-value functions q'^^:i{i) (resp. 
Qk+ii'i)) of action — 7> (resp. <— ) defined as: 

Qk+ii'i) ^ n + lVk{i - I) and qti-i{i) = '^Vk{i + I - I). 

In case i = fc + 1 (B) we will show that q^Xi{i) ~ Qk+i(^) meaning that a policy 7rj,_|_i greedy for Vk may 
be either 7rfe+i(fc + 1) = — s^ or 7r/j_|_i(fc + !) = <—. In all other cases we show that q]^^i{i) < Q^i(i) which 
implies that for those i 7^ fc + 1, TTk+i{i) — <— , as required by Lemma 6. 
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A: In states 1 < i < k + 1 We have q^i{i) = ri + ■yvk{i + i~l) and q'k+ii'i) = JVk{i — 1), depending 
on the value oi i + £ — 1, which is reached by taking the — > action, we need to consider two cases: 

• Case 1: i+i—l^k. In this case Wfc(z+^—l) is described by either (7. a) or (7. d) when z+£—l is less 
than, or greater than fc, respectively. In either case we have Vk{i+i—l) = — 7('^~i)(^™+i)e = Vkii~l) 
and hence: 

Qk+ii^) ^''j +'yvk{i + £- 1) =ri +7Wfc(z- 1) < 7Wfe(i- 1) =g^i(i) 
which gives Trk+i{i) = <— as desired. 

• Case 2: i + l-l = k. 

9fc^i(*) = n+ lVk{k) = r, + 7 {vk{k + <') + r^ - 2e) {(7.c)} 

-r-fe-j + e") + rfe - 2e {(7.b)} 

fc-i \ 

^y(^™+i), + ^,_2e) K-, <0} 

fe-i 
7 I Y. (7'^""+'^e - 27^ e] -el {r^ = -2 ^ 7^"e} 

{y(^m+i)g _ 27^'e < 0} 
K(^-l) = -7('=-l)(^'"+l)e(7.a)} 

giving TTk+i{i) = <— as desired. 
B: In state fc + 1 Looking at the action value function q^_^i in state fc + 1, wc observe that: 

€+i{k + 1) = ivk{k) = 7 (rfe - 2e + Vk{k + i)) {(7.c)} 

= "frk - 276 + 7Wfc(fc + £) 
= rk+i+'^Vk{k + i) {ri+i = jr, - 2'^e} 

This means that the algorithm can take Trk+i{k + 1) = — > so as to satisfy Lemma 6. 

C: In states i = k + l + qi + p We restrict ourselves to the cases when 1 < p < £ and < q < {k — l)m. 
Three cases for the value of q need to be considered: 

• Case 1: < q < {k — l)m — 1. We have: 

gfe^i (i) =ri+ jvk (k + {q + l)e + p) 

— ri + jvkik + q£ + p) {(7.d) independent of q} 

<jvk{k + q£ + p) {r, < 0} 

= qt+iii)- 
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• Case 2: q ^ {k — l)m — 1 

9fc+i(*) = r^+jvk{k+ {q + l)i + p) 



= r,+70 {(7.c)} 

_ k+l+qi+p 

= -2e- ^ 

1-7 

= -2e ( ^^-^^ + 7'=+9^+P 

= ^^k+(k-i)im~e+P^ {q^ik- l)m - 1} 

^ _^fc+(fc-l)£mg ^ _^(k-l)((m+l) + l^ {p-£ <0} 

^JVkik + q£+p) {(7.d)} 

= 9fc^i(*)- 

• Case 3: q = {k — l)m 

9fc^i(*) = '^.: + 7«fc(fc + {{k - l)m + !)£ + p) 

= n+-fO {(7i)} 

= n + -fVk{k + {{k - l)m)£ + p) {(7.e)} 

= rt +7«fc(« - 1) 
<qt+l{^)■ {n<0} 

D: In states i = k + 1 + [qm + p + l)i' In these states, we have: 

it+ii^) =lVk{k + {qm + p + l)e) 

Qk'+iii) =n+-fVk{k + l + {qm + p+l)£ + £- 1) 

= n+-fVk{k + {qm+p + 2)e). (8) 

As for the right-hand side of (8) we need to consider two cases: 

• Case 1: p + 1 < m: 
In the following, define 

Then, 

9^1 (0 ^n+ lVk{k + {qm +{p+l) + 1)£) 

(^l(p+2) ^(.(rn+l) k-q-1 , , ^«(m+l) x , 

'—Y^. ^-.+ E ^^'"""^^ ( ";!y -^-.-.+0 I {(7-b)} 

/ /^f(p+l) _ ^£(m+l) \ N 

= r, + 7^(^'"+i)+i (^ (1 ^^ y (P+i) j vk-q + Xk.,^ 

/^e{p+l) _ l{m+l) \ 

- r, - ^i'>rn+P+iy+,+ l,^_^ + ^.(^™+l) + l (^7 __J_ ^^_^ ^ ^^^^ 

= r, - y-'+«rfc_, + 7«fe(fc + {qm + p+ 1)£) - l[p^o]7''""+'^+'e {(7.b)} 

<ri~ -f'-'^+'^rk-q + lVk{k + {qm+p+ 1)£) 

= ^,_y-^-+9^,„^+q-^(^). (9) 
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Now, observe that 

f -'^+«rfc_, = -27 



1-7 




1-7 


-Y 



1-7 

= -2^^ -e - 2 '^ ^ e 

1 — 7 1 ~ 7 

= ''i ^ 7'i-fc+g+l- 

Plugging this back into (9), we get: 

9fe^i(0 < r, -n + n^k+q+i + qt+S) 

< it+iii)- {n-k+q+l < 0} 

• Case 2: p + 1 = m: 

Using the fact that p + 1 ~ m, implies j^t = 7^™ we have: 

q^+iii) =r,+ -fVkik + {{q + l)m + 1)£) 

(^e ^e(m+l) k-q-2 , f ^£(m+l) > , 

^\l^, rk-q-^+e+ X: 7^-^^'"+^ Y ^-y ^^-^-^-^+01 {(7-b)} 

/k-q-2 , f ^«(m+l) 



(/^/(P+I) ^/(™+l) \ k-q-\ , f I(m+1) \ 



= r, - 7«(^"+i)+iy"V,_, + 7 (vk{k + {qm+p+ m - l[p=o]7^^""+'^e) {(7.b)} 

<n- f^'^+'^rk-q + -fVkik +{qm+p+ !)£) 

where we concluded by observing that this is the same result as (9). 

E: In state i = k + {{k - 1)to + 1)^+1 

qt+iii) = l^kii - 1) = IVkik + {{k - l)m + 1)1) 

= ^('=-i)(«™+i)+i [ 7 -7 ^ ^^ ^ g j 1^^ ,^^ ^j^j^ q = A: - 1 and p = 0} 

= ^C^-Dl^^+D+ie {n =. 0} 

> »■* {r. < 0} 

= r,+jvk{i + i-l) {vk{i + e+l) = (7.1)} 
^Qk'+iii)- 

F: In states i > fc + ((fc — l)m + 1)^ + 1 Following (7.1) we have Vk{i — 1) = Vk{i + i — I) — and hence 

9fc^i(«) = >r, = gfe^i(i). 
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5.2.2 The value function v^+i 

In the following we will show that the value function Vk+i satisfies Lemma 6. To that end we consider 
the value of {(T^.+i/)"^Tk+iVk){sQ) by analysing the trajectories obtained by first following m times nj^^g 
then TTfe+i from various starting states sq. 

Given a starting state sq and a non stationary policy TTfe+i^^, we will represent the trajectories as a 
sequence of triples (si,ai,r(si,ai))i=o,...,^m arranged in a "trajectory matrix" of £ columns and m rows. 
Each column corresponds to one of the policies TTk+i,T^k, ■ ■ ■ ,T^k+2-e- In a column labeled by policy ttj 
the entries are of the form (si, TTj{si), r[si, TTj{si)); this layout makes clear which stationary policy is used 
to select the action in any particular step in the trajectory. Indeed, in column ttj, we have (s^,— >,rj) 
if and only if Si = j, otherwise each entry is of the form (si,-<— ,0). Such a matrix accounts for the 
first m applications of the operator Tk+i/. One addional row of only one triple {si,Trk+i{si),rTri_^^{si)) 
represents the final application of Tk+i. After this triple comes the end state of the trajectory sem+i- 

£ — 3 steps 



m — A times 



7r4 TTs 7r2 

(10,^,0) (9,^,0) (8,^,0) 

(7,^,0) (6,^,0) (5,^,0) 

(4,-^,r4) (6,^,0) (5,^,0) 

(4,^,r4) (6,^,0) (5,^,0) 

(4,^,r4) @ 

Figure 5: The trajectory matrix of policy tt4^j starting from state 10 with m = 4 and £ — 3. 

Example 2. Figure 5 depicts the trajectory matrix of policy 7r4,£ = 7r47r37r2 with m = 4 and £ = 3. The 
trajectory starts from state sq — 10 and ends in state sim+i ~ 6. The <— action is always taken with 
reward except when in state 4 under the policy 7r4 . From this matrix we can deduce that, for any value 
function v: 

((T4,,)"T4i;)(10) = 7'^4 + 7V + 7''^4 + 7"^(6) 

= -/^'r^ + 7'V4 + 7''r4 + j^'+\{6) 

= ^ , ^ , r4+y"+V6). 

i — 7 

With this in hand, we are going to prove each case of Lemma 6 for Vk+i- 

In states i < k + 1 Following m times iTk+i^i and then iTk+i starting from these states consists in 
taking the <— action £m + 1 times to eventually finish either in state 1 if i < £m + 2 with value 

v,+,i^) = y™+'^.(l) + ek+l{^) = -y"+i7('=-i)(^"+i), = -/■(^"+i), 

or otherwise in state i — £m — 1 < fc with value 

Vk+l{^) = j'-'^'vki^ - £m - 1) + 6fe+i(z) = -y™+i7('=-i)(^™+i)e = _/(^™+i). 

This matches Equation (7. a) in both cases. 

In states i = k + 1 + {qm + p + l)£ Consider the states i — k + 1 + {qm + p + l)£ with q > Q and 
< p < m. Following m times iTk+i.e and then TTfc+i starting from state i gives the following trajectories: 

• when q = Q, {i.e. i = k + 1 + {p + 1)£): 
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I steps 



p + 1 times 



m — p — 1 times 



TTfc+l TTfc 

(fc+l + (p+ 1)^,^,0) (fc+(p+l)£,^,0) 



(fc + l + £,-^,0) 
(fc + l,^,rfe+i) 

(fc + l,^,rfe+i) 






k + e 



Using (7.b) with q = p = as our induction hypothesis, this gives 



j=p+i 

k-l 



(fc+p^ + 2,-^,0) 
(fc + (p - IK + 2, .^, 0) 

(fc + 2,^,0) 
(fc + 2,^,0) 

(fc + 2,^,0) 



E 7^^V.+i + /-- I ^'-'!'T ru + . + E V"""^^ ^ ^^^ 



£ _ ^^(m+1) 



1-7- 



-T-fc-, + e 



1-7^ 



..^i+E^^'^'"^^^'^^^ 



j=i 



«(m+l) 



l-7« 



-7-fe-,- + e 



Accounting for the error term and the fact that i = k + 1 + £ 

Vk+i{i) = {{Tk+ixy'Tk+iVk) (i) + l[j=fc+i+f]e 



p = q = 0, we get 



y (p+i) „ y(™+i) 



rk+i + t[p^o]£ + ^j^ 



(^m+1) / 7 "7' 



/ „ ^«(m+l) 



1-y 



-rfe-,- + e 



which is (7.b) for fc + 1 and g = as desired. 

• when I < q < k: 

In this case we have i — {im + 1) > k + 1, meaning that fc + 1, the first state where the — ?► action 
would be available is unreachable (in the sense that the tractory could end in fc + 1, but no action 
will be taken there). Consequently the -s— action is taken £m + 1 times and the system ends in state 
i — £m — 1 = fc + ((g — l)m + p + 1)£. Therefore, using (7.b) as induction hypothesis and the fact that 
i ^ {fc + 1, fc + ^+ 1} =^ efe+i(i) = 0, we have: 

Vk+i{t) = 7""+'«fc(fc + {{q - l)m + p+ 1)£) + ek+i{i) 

7' --, —f '^fe+i-q + l[p=o]e + 2^7 ' M -^ -I ?-fc+i-g-fc + e 



I l_y 

which statisfies (7.b) for fc + 1. 
In state fc+1 Following m times tt^+i^^ and then tt^+i starting from fc+1 gives the following trajectory: 
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m times 



(. steps 

T^k+l TTfe ... T^k-l+2 

(fc + l,^,rfc+i) (fc + £,^,0) ... (fc + 2,^,0) 



(fc + l,^,rfc+i) (fc + £,^,0) ... (fc + 2,^,0) 



(fc + l,^,rfc+i) |fc + ^ 
As a consequence, with (7.c) as induction hypothesis we have: 



((rfc+i,,)"Tfe+it;fc)(fc+l) 



1 _ ^<?(m+l) 



i_y 



-rfc+i+y'"+'^^fe(fc + ^) 



T-fc+l + 



7 -7' 



i{m+l) 



l-Y 



-Tk+l + 7 



lm+1 7 7 



fe-1 



l-7« 



,, + , + ^y(^™+i)f2LJlT 



j=i 



« „ V(ni+1) 



1-Y 



-Tk-j + e 



= rk+i 



y _ y("i+i) 



1-7 
= rk+i + Vk+i{k + £ + 1) - e 

Hence, 



^.+i+Ev^'"^'^'^^-^ 



j=i 



+1) 



«fc+i(fc + 1) = ((rfe+i,,)'"Tfc+iWfe) (fc + 1) + e,.+i(A: + 1) 



which matches (7.c). 



In states i — k + l+q£ + p For states i = k + 1 + q£ + p with < q < km — 1 and 1 < p < £, the 
pohcy TTk+i,e always takes the -s— action with either one of the following trajectories 

• when q > m: 



m times 



£ steps 

TTfe+l TTfc 



(/c + (q-l)^ + p + 2,^,0) 



(fc + l + (g-TO+ l)£ + p, ^,0) (fc + q^ + p, ^,0) ... (fc + (g-TO)^ + p + 2,^,0) 



{k + l + {q-m)£ + p,^,0) k + {q - m)£ + p 



As a consequence, with (7.d) as induction hypothesis we have: 

Vk+iii) - ((Tfc+i,,)'"rfe+it;fe) (^) = y^+'^'fclfc + (9 - m)i' +p) = -Y'^+^^C^^^K^-^+^h = -/(^™+i)e 

which satisfies (7.d) in this case. 

• when q < m: 

Assuming that negative states correspond to state 1, where the action is irrelevant, we have the 
following trajectory: 
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steps 



q times 



m — q times 



TTfe+l 

{k + l + qe + p,^,0) 
{k + l+p,^,0) 

{k + i-e + p,^,o) 

{k + 1 - (m - q - l)£ + p,^, 0) 
(fc + 1- {m-q)e + p,^,0) 



{k+{q-l)£ + p + 2,^,0) 

(fc+p + 2,^,0) 
(fc-^ + p + 2,^,0) 



(k — {m 



' + p + 2,^,0) 



k + {q — 171)1 + p 



In the above trajectory, one can see that only the <— action is taken (ignoring state 1). Indeed, since 
we follow the policies irk+iT^k, ■ ■ ■ , T^k-i+2 the — )■ action may only be taken in states k + l,k, . . . ,k~i + 2. 
When state fc + 1 is reached, the selected action is TTk-p+i{k + 1) which is <— since p > 1. The same 
reasonning applies in the next states k, . . . ,k ~ i + 1, where p > I prevents to use a policy that would 
select the — >■ action in those states. 



Since p — £ < the trajectory always terminates in a state j < k with value Vk{j) 
as for the q> m case, which allows to conclude that (7.d) also holds in this case. 



^ _^(fc-i)(^™-i)f 



In states i = k + 1 + km£ + p Observe that following m times iTk+i,i and then 'Kk+i once amounts to 
always take -i— actions. Thus, one eventually finishes in state fc + (fc — l)m£ + p > fc + 1, which, since 
efc(J) = 0, gives 

Vk+i(i) = ((n+i,irn+ivu) a) = Y'''+^vu{k + {k- l)mi+p) = -7^™+iO = 0, 

satisfiying (7.c). 

In states i > k + 1 + {km + l)£ In these states, the action -s— is taken £m + 1 times ending up in state 
j > k + {{k — l)m + 1)£, with value Vk{j) = 0, from which Vk+i{i) = follows as required by (7.f). 



6 Empirical Illustration 

In this last section, we describe an empirical illustration of our new variation of MPI on the dynamic 
location problem from Bertsekas and Yu (2012). The problem involves a repairman moving between 
n sites according to some transition probabilities. As to allow him do his work, a trailer containing 
supplies for the repair jobs can be relocated to any of the sites at each decision epoch. The problem 
consists in finding a relocation policy for the trailer according the repairman's and trailer's positions 
which maximizes the discounted expectation of a reward function. 

Given n sites, the state space has n^ states comprising the locations of both the repairman and the 
trailer. There are n actions, each one corresponds to a possible destination of the trailer. Given an 
action a = 1, . . . , n, and a state s = {sr, St), where the repairman and the trailer are at locations Sr and 
St, respectively, we define the reward as r{s, a) = —\sr — St\ — \st — a|/2. At any time-step the repairman 
moves from its location Sr < n with uniform probability to any location Sr ^ s'^ < n; when Sr = n, he 
moves to site 1 with probability 0.75 or otherwise stays. Since the trailer moves are deterministic, the 
transition function is 



T{{sr,st),a,{s'^,a)) = 



0.75 
0.25 


if Sr < n 

if Sr = 71 A S^ 

if Sr == rt A s' 


= 1 

~ n 



and everywhere else. 
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Figure 6: Average error of policy Tik.i per iteration k of MPI. Red lines for ^ = 1, yellow for 
for ^ = 5 and blue for t = 10. 



green 
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Im = 10 





im. = 100 




Figure 7: Policy error and standard deviation after 150 iterations for different different values of £. Each 
plot represents a fixed value of the product im. Data is collected over 250 runs with n = 8. 

We evaluated the empirical performance gain of using non-stationary policies by implementing the 
algorithm using random error vectors e^, with each component being uniformly random between 
and some user-supplied value e. The adjustable size (with n) of the state and actions spaces allowed to 
compute an optimal policy to compare with the approximate ones generated by MPI for all combinations 
of parameters £ £ {1,2,5,10} and m e {1, 2, 5, 10, 25,oo}. Recall that the cases m = 1 and m = oo 
correspond respectively to the non-stationary variants of VI and PI of Schcrrcr and Lcsncr (2012), while 
the case £ = 1 corresponds to the standard MPI algorithm. We used n = 8 locations, 7 = 0.98 and e = 4 
in all experiments. 



Figure 6 shows the average value of the error u* 



per iteration for the different values of 



parameters m and i. For each parameter combination, the results are obtained by averaging over 250 
runs. While higher values of £ impacts computational efficiency (by a factor 0{£)) it always results with 
better performance. Especially with the lower values of to, a higher £ allows for faster convergence. While 
increasing m^ this trend fades to be finally reversed in favor of faster convergence for small £. However, 
while small £ converges faster, it is with greater error than with higher £ after convergence. It can be 
seen that convergence is attained shortly after the £*^ iteration which can be explained by the fact that 
the first policies (involving ttq, 7r_i, . . . , 7r_£_|_2), are of poor quality and the algorithm must perform at 
least £ iterations to "push them out" of nk^i- 

We conducted a second experiment to study the relative influence of the parameters £ and m. From 
the observation that the time complexity of an iteration of MPI can be roughly summarized by the 
number £m -I- 1 of applications of a stationary policy's Bellman operator, we ran the algorithm for fixed 
values of the product £m and measured the policy error for varying values of £ after 150 iterations. These 
results are depicted on Figure 7. This setting gives insight on how to set both parameters for a given 
"time budget" £m. While runs with a lower £ are slightly faster to converge, higher values always give 
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the best policies after a sufficient number of iterations. It appears that favoring £ instead of m seems 
to always be a good approach since it also greatly reduces the variance across all runs, showing that 
non-stationarity adds robustness to the approximation noise. 
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